New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Term Count on Search Results #3920
Comments
Might this requirement be similar to #3924 ? Also I am curious: What is your use case? |
Sorry I wasn't clear. On my search results page I have to return a word count of one of the fields for every search result. It happens to be my longest field. And it has to support scriptio continua languages so I can't do something simple like count the number of spaces in my app and save that number to ES to retrieve with the search results. Anyway, Elasticsearch has a word count already in the form of the per field per document term vectors that I already store to use the FVH. Also luckily I process that field with an analyzer that doesn't add synonyms or funky word breaks. If I can ask Elasticsearch to count the terms in that field that'll give me my word count. Anyway, it doesn't what I need is pretty simple in comparison to the term vector api. I won't be listing terms and I only want term information for a single document. I also want it bundled in the search results so I don't have to make any additional requests. I'll send a pull request that implements this today or tomorrow which should make it crystal clear |
just for kicks, can you build a customer analyzer that consumes all tokens and returns the number of tokens in the field as a token and then sort by it. You would need to parse the string but it would work no? |
+1. Use cases can include faceting, scripted scoring, record linkage and whatnot. @s1monw all that is required is a custom TokenFilter really, but that token doesn't have access to the IW / Document object so you will need to go through the analysis chain twice |
Sorry, what do you mean?
I like this idea. In that case it'd make sense to build the field in the mapping, maybe like this: curl -XPUT http://localhost:9200/test/test/_mapping?pretty -d'{
"test" : {
"properties": {
"foo" : {
"type": "string",
"store": "yes",
"write_term_count" : "foo_term_count"
},
"foo_term_count" : {
"type": "integer",
"store": "yes"
}
}
}
} It'd be a pain to have to use the custom analyzer and analyze everything twice but it'd be worth it if it enables lots of fun features. I'll have a look later today I think. |
Record linkage is whenever you want to find similar documents, and word count can be a good hint for that. |
This other issue looks similar to what was asked here, although it proposes a separate api for it: #640 . |
I think someone is confusing Word-Count in a field of a specific document with Term Count of all documents in a field. Not sure who that is, though :) |
Indeed, that other issue is a completely different story, I should have read more carefully. Thanks for clarifying that @synhershko |
I got this working today. I'll send a pull request for it as soon as it passes all of its tests. Github has helpfully created a link to my implementation above for anyone curious. The unit test covers returning the count in the search results, searching for it via a range query, and faceting. It covers counting both single and multi-valued fields both on the root and inside of an object. For multi-valued fields it writes multiple term counts - it doesn't add them. |
Also, while I think about it I'm pretty sure I did a few things wrong and would love some tips on the right way:
|
Would anyone else be interesting in getting elasticsearch to return a count of the terms in a field in the search results? If you (like me) need to return a word count of a field then this could be useful to you. I also could get a count of distinct terms but I'm not super sure who'd use it. I was thinking the api could be something like this:
And it'd return
"foo._term_count" : 6,
in the results.It'd require
term_vector
s to be stored but not offsets or positions. Since it'd count the terms on each search result it'd be similar to highlighting using the FVH but faster because it does essentially no work other than the term vector scanning.I don't imagine you'd be able to sort by them.
The text was updated successfully, but these errors were encountered: